Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ssh: Support specifying identity file via environment variable #3149

Merged
merged 4 commits into from Feb 15, 2019

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Feb 7, 2019

This enables callers to use sshrun with non-standard identity files.
For example, in the ReproMan project, we want to use DataLad commands
with an AWS EC2 instance. In this case, the host will not be in
.ssh/config, and the key files are usually not in the default
location.

@kyleam kyleam added the WIP work in progress label Feb 7, 2019
@kyleam
Copy link
Contributor Author

kyleam commented Feb 7, 2019

Doh, my attempt at testing is probably failing due to the connection caching. Will have to revisit.

@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

Merging #3149 into master will decrease coverage by 44.79%.
The diff coverage is 38.88%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   89.47%   44.68%   -44.8%     
==========================================
  Files         248      248              
  Lines       32703    32717      +14     
==========================================
- Hits        29262    14619   -14643     
- Misses       3441    18098   +14657
Impacted Files Coverage Δ
datalad/support/sshrun.py 78.78% <ø> (-18.19%) ⬇️
datalad/support/tests/test_sshconnector.py 28.14% <33.33%> (-70.19%) ⬇️
datalad/support/sshconnector.py 59.51% <66.66%> (-25.15%) ⬇️
datalad/support/tests/utils.py 13.04% <0%> (-86.96%) ⬇️
datalad/support/tests/test_stats.py 13.11% <0%> (-86.89%) ⬇️
datalad/support/tests/test_repodates.py 13.46% <0%> (-86.54%) ⬇️
datalad/cmdline/tests/test_formatters.py 14.28% <0%> (-85.72%) ⬇️
datalad/tests/test_protocols.py 14.81% <0%> (-85.19%) ⬇️
datalad/tests/test_dochelpers.py 15.49% <0%> (-84.51%) ⬇️
datalad/tests/test_config.py 14.88% <0%> (-83.93%) ⬇️
... and 201 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca3225e...3a48742. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 8, 2019

Codecov Report

Merging #3149 into master will decrease coverage by 4.14%.
The diff coverage is 96.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   90.71%   86.56%   -4.15%     
==========================================
  Files         249      249              
  Lines       32723    32743      +20     
==========================================
- Hits        29684    28345    -1339     
- Misses       3039     4398    +1359
Impacted Files Coverage Δ
datalad/interface/common_cfg.py 100% <ø> (ø) ⬆️
datalad/support/sshrun.py 96.96% <ø> (ø) ⬆️
datalad/support/sshconnector.py 85.09% <100%> (+0.44%) ⬆️
datalad/support/tests/test_sshconnector.py 99.27% <95.45%> (+0.94%) ⬆️
datalad/tests/utils_testdatasets.py 11.76% <0%> (-88.24%) ⬇️
datalad/interface/tests/test_annotate_paths.py 24% <0%> (-76%) ⬇️
datalad/plugin/tests/test_addurls.py 59.92% <0%> (-40.08%) ⬇️
datalad/distribution/tests/test_get.py 60.72% <0%> (-39.28%) ⬇️
datalad/metadata/tests/test_base.py 61.37% <0%> (-37.94%) ⬇️
...atalad/interface/tests/test_add_archive_content.py 68.3% <0%> (-31.7%) ⬇️
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81bfe17...93c8651. Read the comment docs.

@yarikoptic yarikoptic added this to the Release 0.11.3 milestone Feb 10, 2019
@kyleam kyleam force-pushed the ssh-identify-env branch 2 times, most recently from 511acbc to 3a48742 Compare February 11, 2019 20:01
@kyleam
Copy link
Contributor Author

kyleam commented Feb 11, 2019

Hmph, with the latest iteration (3a48742), I'm stumped on why test_ssh_manager_close fails with the addition of test_ssh_custom_identity_file. Seems like it'd have to be some sort of order/state dependence, but I haven't had luck figuring out what's going on.

@yarikoptic
Copy link
Member

Could it be due to all already existing control paths we established in other tests?

@kyleam
Copy link
Contributor Author

kyleam commented Feb 12, 2019

Could it be due to all already existing control paths we established in other tests?

Right, that was along the lines of what I was thinking when I wrote "some sort of order/state dependence". The problem is that, even if the new test were using the same socket (it's not) and if it weren't closing it (it is), I don't see how test_ssh_manager_close would fail the way that it does---saying that the path for the datalad-test socket doesn't exist.

There's Travis-specific setup for these tests, so I've been restricting my local debugging to test_ssh_custom_identity_file (with an adjustment to its ifile variable). I'm next going to try to get all the tests in test_sshconnector.py running locally to see if I can trigger the issue and step through the failing test_ssh_manager_close.

@kyleam
Copy link
Contributor Author

kyleam commented Feb 12, 2019

OK, so it looks like the issue is that, after patching os.environs DATALAD_SSH_IDENTITYFILE, the value leaks into the other tests because get_connection was only doing a cfg.reload(), not cfg.reload(force=True).

So I guess the main question is who should be responsible for calling reload(force=True), SSHManager or its creators? If the answer is SSHManager, then there is the follow-up question of where SSHManager should do it, get_connection or just at initialization (__init__ or assure_initialized)?

I'm leaning towards making SSHManager do it in assure_initialized because that function already imports ConfigManager (not datalad.cfg), so it essentially does a reload(force=True) and we can piggyback on it without an added expense. I'll do that unless someone argues otherwise.

This test was written at a time when the socket path basename was
simply the plain hostname.  The basename has been a hash of various
connection details since 5be560a (BF+ENH: Hash-based unique
ControlPaths for SSH (fixes dataladgh-1243), 2017-01-31).
Upcoming commits will make it possible to configure an identity file
to pass to ssh's -i.  Consider the identity file when creating the
connection hash because it's a defining feature of the connection.
The next commit will make this accessible to outside callers by
adjusting SSHManager and exposing a configuration variable.
This enables callers to use sshrun with non-standard identity files.
For example, in the ReproMan project, we want to use DataLad commands
with an AWS EC2 instance.  In this case, the host will not be in
.ssh/config, and the key files are usually not in the default
location.

A plain environment variable would do for the above usecase, but add
it as a variable in common_cfg.py for the visibility and
documentation.
@kyleam
Copy link
Contributor Author

kyleam commented Feb 12, 2019

I'm leaning towards making SSHManager do it in assure_initialized because [...]

Done with latest push.

range-diff
4:  c89249d29 = 1:  9cf3e0f2d TST+BF: sshconnector: Fix stale socket path construction
1:  f0377e599 = 2:  e9009eced ENH: ssh: Add identity_file parameter to get_connection_hash()
2:  30e7e18af = 3:  dc3226f5e ENH: SSHConnection: Support custom identity files
3:  8bf6a2ce9 ! 4:  93c86517b ENH: ssh: Support configuring an identity file
    @@ -32,21 +32,44 @@
      diff --git a/datalad/support/sshconnector.py b/datalad/support/sshconnector.py
      --- a/datalad/support/sshconnector.py
      +++ b/datalad/support/sshconnector.py
    +@@
    + class SSHManager(object):
    +     """Keeps ssh connections to share. Serves singleton representation
    +     per connection.
    ++
    ++    Note: If a connection should use a custom identity file via
    ++    `datalad.ssh.identityfile`, that value should be set before this class is
    ++    initialized.
    +     """
    + 
    +     def __init__(self):
    +@@
    +         # handling of socket_dir, so we do not define them here explicitly
    +         # to an empty list to fail if logic is violated
    +         self._prev_connections = None
    ++        self._identity_file = None
    +         # and no explicit initialization in the constructor
    +         # self.assure_initialized()
    + 
    +@@
    +         cfg = ConfigManager()
    +         self._socket_dir = opj(cfg.obtain('datalad.locations.cache'),
    +                                'sockets')
    ++        self._identity_file = cfg.get('datalad.ssh.identityfile')
    +         assure_dir(self._socket_dir)
    +         try:
    +             chmod(self._socket_dir, 0o700)
     @@
                  raise ValueError("Unsupported SSH URL: '{0}', use "
                                   "ssh://host/path or host:path syntax".format(url))
      
    -+        from datalad import cfg
    -+        # Reload because Python callers may have set environment variable after
    -+        # importing datalad. `force` needs to be true so that temporary
    -+        # environmental variables don't linger.
    -+        cfg.reload(force=True)
    ++        # Ensure self._identityfile has been set, if configured.
    ++        self.assure_initialized()
     +
    -+        identity_file = cfg.get('datalad.ssh.identityfile')
              conhash = get_connection_hash(
                  sshri.hostname,
                  port=sshri.port,
    -+            identity_file=identity_file or "",
    ++            identity_file=self._identity_file or "",
                  username=sshri.username)
              # determine control master:
              ctrl_path = "%s/%s" % (self.socket_dir, conhash)
    @@ -55,7 +78,8 @@
                  return self._connections[ctrl_path]
              else:
     -            c = SSHConnection(ctrl_path, sshri)
    -+            c = SSHConnection(ctrl_path, sshri, identity_file=identity_file)
    ++            c = SSHConnection(ctrl_path, sshri,
    ++                              identity_file=self._identity_file)
                  self._connections[ctrl_path] = c
                  return c
      

@kyleam kyleam removed the WIP work in progress label Feb 12, 2019
@yarikoptic yarikoptic merged commit cf3f899 into datalad:master Feb 15, 2019
@kyleam kyleam deleted the ssh-identify-env branch February 15, 2019 19:22
@kyleam
Copy link
Contributor Author

kyleam commented Feb 15, 2019

I'm leaning towards making SSHManager do it in assure_initialized

Eh after hooking up this latest approach to ReproMan, I'm not sure I made the right decision (I should have tested earlier). I was hoping that using assure_initialized would mean that the config reading would be delayed until datalad.ssh_managers first get_connection call (the only other spot where SSHManager calls assure_initialized is _socket_dir). But GitRepo.__init__() also calls assure_initialized. This means that if any external python callers want to set the environment variable to override this, they need to do it before an GitRepo initialization/use (e.g., using Dataset.id), not just before anything that might call get_connection.

So the options I see are

  • Call config.reload(force=True) for each get_connection() call. That's the expense that the above approach was trying to avoid, but I'm not sure how bad it is in practice.

  • Go back to the first iteration of this PR and just check os.environ, dropping the common_cfg entry.

  • Do nothing. It doesn't seem that bad to say the python callers just need to set up the environment early, but depending on how the program is structured and the task, it might be difficult or impossible for it to set the environment variables before it initializes a GitRepo. And I don't believe there is currently an easy way for the caller to explicitly ask ssh_manager to refresh its info.

@yarikoptic, do you have a preference or see another, better option?

@kyleam
Copy link
Contributor Author

kyleam commented Feb 15, 2019

Another option (currently my preference):

Adjust get_connection to look up the identity file with datalad.config.get (no .reload before). Document that if python callers have modified the config/environment after loading datalad and want to ensure that things are up to date, they should call datalad.config.reload(force=True).

@yarikoptic
Copy link
Member

yarikoptic commented Feb 15, 2019

... if any external python callers want to set the environment variable...

"external" -- aren't you are using datalad Python API in reproman? shouldn't it then just change config directly (from datalad import cfg)? Then assure_initialized could memoize on either it was initialized with the current setting of the datalad.ssh.identityfile, and if value changes - reinitialize. Then you could even jump between different identities within the same process. (that is the "solution" I had in mind when I was leaving)

@kyleam
Copy link
Contributor Author

kyleam commented Feb 15, 2019

"external" -- aren't you are using datalad Python API in reproman?

Yes.

shouldn't it then just change config directly (from datalad import cfg)?

Regardless of what my specific code does, I was under the assumption that we wanted to support Python callers setting environment variables. That's the reason for the whole reload(force=True) discussion in the first place, but that was probably a silly assumption on my part. I guess if that were the case, we'd need cfg.reload(force=True) sprinkled all over the codebase.

So I don't mind going in the direction you suggest, though I'd prefer the last option I suggested, which would accomplish the same thing without the need for SSHManager to have an _identity_file attribute.

@yarikoptic
Copy link
Member

whatever you like the best. I do not think though it is really bad to store _identity_file attribute to provide more of automated handling, but may be I do not see possible side-effects (e.g. if we initiate some connection with out the identity_file, then need to initiate another with the identityfile -- would initialization kill existing sockets/connections? didn't check the code)

@kyleam
Copy link
Contributor Author

kyleam commented Feb 15, 2019

I do not think though it is really bad to store _identity_file attribute to provide more of automated handling

I'm failing to see what the extra automation is. Either way you have a config.get call. By doing it in get_connection you do it closer to the only place the value is currently used, don't need the extra code for updating the attribute, and avoid changing the way the current assure_initialized works.

@kyleam kyleam mentioned this pull request Feb 15, 2019
3 tasks
@yarikoptic
Copy link
Member

yarikoptic commented Feb 16, 2019

Regarding your memory about needing it in environment variable... We might indeed need that (possibly to set it if not yet set while "treating config variable") because if we have a datalad special remote initiated by some annex call, it might be needed. But I don't think we have a specific use case for that yet I think.

kyleam added a commit to kyleam/datalad that referenced this pull request Feb 16, 2019
This reverts and replaces the sshconnector.py changes from
93c8651 (ENH: ssh: Support configuring an identity file,
2019-02-07).

93c8651 updated assure_initialized() to get datalad.ssh.identityfile
from its local ConfigManager instance and store that value as an
attribute for later use.  The motivation for retrieving the value in
assure_initialized() was to avoid the expense of calling
datalad.cfg.reload(force=True) with each get_connection() call.  In
turn, the motivation for using reload(force=True) was to prevent
test_ssh_custom_identity_file's patched DATALAD_SSH_IDENTITYFILE from
leaking into other tests [*].

The problem with this approach is that Python callers must set the
identity file before the first assure_initialized() call---which could
happen in SSHManager.get_connection() or _much earlier_ because
GitRepo.__init__() also calls assure_initialized().  After the first
assure_initialized() call, callers don't have a straightforward way of
changing the setting.

Instead let's use a plain datalad.cfg.get() call, without reloading,
within get_connection().  This gives Python callers the ability to
update the setting before _any_ get_connection() call, but they are
now responsible for reloading datalad.cfg if needed.

[*]: datalad#3149 (comment)
Re: datalad#3149
yarikoptic added a commit that referenced this pull request Apr 6, 2019
 ## 0.11.3 (Feb 19, 2019) -- read-me-gently

 Just a few of important fixes and minor enhancements.

 ### Fixes

 - The logic for setting the maximum command line length now works
   around Python 3.4 returning an unreasonably high value for
   `SC_ARG_MAX` on Debian systems. ([#3165])

 - DataLad commands that are conceptually "read-only", such as
   `datalad ls -L`, can fail when the caller lacks write permissions
   because git-annex tries merging remote git-annex branches to update
   information about availability. DataLad now disables
   `annex.merge-annex-branches` in some common "read-only" scenarios to
   avoid these failures. ([#3164])

 ### Enhancements and new features

 - Accessing an "unbound" dataset method now automatically imports the
   necessary module rather than requiring an explicit import from the
   Python caller. For example, calling `Dataset.add` no longer needs to
   be preceded by `from datalad.distribution.add import Add` or an
   import of `datalad.api`. ([#3156])

 - Configuring the new variable `datalad.ssh.identityfile` instructs
   DataLad to pass a value to the `-i` option of `ssh`. ([#3149])
   ([#3168])

* tag '0.11.3': (35 commits)
  ENH: Finalized changelog entry for 0.11.3
  [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.3 entries
  CHANGELOG: Mention #3168
  ENH: sshconnector: Get identity file with datalad.cfg.get()
  [DATALAD RUNCMD] CHANGELOG: Linkify 0.11.3 entries
  CHANGELOG: Add entries for 0.11.3
  BF/RF(TST): skip test if actual sudo chown call fails
  BF(TMP): declare check_datasets_datalad_org failing on windows
  BF(TST): replace not relevant trailing .pull test with .repo_info
  BF(BK): for some reason an exception on repo_info invocation isn't raised while on travis
  TST: test_ro_operations  via sudo (when possible)
  BF: allow to disallow git-annex branch merges for repo_info and use that in ls
  BF(unicode): use assure_unicode while formatting an exception
  BF(PY3): workaround for python3.4 on debian returning obnoxious SC_ARG_MAX
  BF: revert change for repo_info about not merging remote git-annex
  BF: set annex.merge-annex-branches=false for invocations not requiring updated remote information
  ENH: ssh: Support configuring an identity file
  ENH: SSHConnection: Support custom identity files
  ENH: ssh: Add identity_file parameter to get_connection_hash()
  TST+BF: sshconnector: Fix stale socket path construction
  ...
yarikoptic added a commit that referenced this pull request Apr 6, 2019
 ## 0.11.3 (Feb 19, 2019) -- read-me-gently

 Just a few of important fixes and minor enhancements.

 ### Fixes

 - The logic for setting the maximum command line length now works
   around Python 3.4 returning an unreasonably high value for
   `SC_ARG_MAX` on Debian systems. ([#3165])

 - DataLad commands that are conceptually "read-only", such as
   `datalad ls -L`, can fail when the caller lacks write permissions
   because git-annex tries merging remote git-annex branches to update
   information about availability. DataLad now disables
   `annex.merge-annex-branches` in some common "read-only" scenarios to
   avoid these failures. ([#3164])

 ### Enhancements and new features

 - Accessing an "unbound" dataset method now automatically imports the
   necessary module rather than requiring an explicit import from the
   Python caller. For example, calling `Dataset.add` no longer needs to
   be preceded by `from datalad.distribution.add import Add` or an
   import of `datalad.api`. ([#3156])

 - Configuring the new variable `datalad.ssh.identityfile` instructs
   DataLad to pass a value to the `-i` option of `ssh`. ([#3149])
   ([#3168])

* tag '0.11.3':
  BF(TST): skip if windows OR root; catch any exception
yarikoptic added a commit that referenced this pull request Apr 6, 2019
 ## 0.11.3 (Feb 19, 2019) -- read-me-gently

 Just a few of important fixes and minor enhancements.

 ### Fixes

 - The logic for setting the maximum command line length now works
   around Python 3.4 returning an unreasonably high value for
   `SC_ARG_MAX` on Debian systems. ([#3165])

 - DataLad commands that are conceptually "read-only", such as
   `datalad ls -L`, can fail when the caller lacks write permissions
   because git-annex tries merging remote git-annex branches to update
   information about availability. DataLad now disables
   `annex.merge-annex-branches` in some common "read-only" scenarios to
   avoid these failures. ([#3164])

 ### Enhancements and new features

 - Accessing an "unbound" dataset method now automatically imports the
   necessary module rather than requiring an explicit import from the
   Python caller. For example, calling `Dataset.add` no longer needs to
   be preceded by `from datalad.distribution.add import Add` or an
   import of `datalad.api`. ([#3156])

 - Configuring the new variable `datalad.ssh.identityfile` instructs
   DataLad to pass a value to the `-i` option of `ssh`. ([#3149])
   ([#3168])

* tag '0.11.3':
  Boost version to 0.11.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants